Renders
knitr::opts_chunk$set(options(knitr.duplicate.label = "allow"))
rmarkdown::render("BCB420_A1_DimitrijeRatkov.Rmd")
##
##
## processing file: BCB420_A1_DimitrijeRatkov.Rmd
##
|
| | 0%
|
|... | 4%
## ordinary text without R code
##
##
|
|...... | 9%
## label: unnamed-chunk-6
##
|
|......... | 13%
## ordinary text without R code
##
##
|
|............ | 17%
## label: unnamed-chunk-7
##
|
|............... | 22%
## ordinary text without R code
##
##
|
|.................. | 26%
## label: unnamed-chunk-8
##
|
|..................... | 30%
## ordinary text without R code
##
##
|
|........................ | 35%
## label: unnamed-chunk-9
##
|
|........................... | 39%
## ordinary text without R code
##
##
|
|.............................. | 43%
## label: unnamed-chunk-10
##
|
|................................. | 48%
## ordinary text without R code
##
##
|
|..................................... | 52%
## label: unnamed-chunk-11
##
|
|........................................ | 57%
## ordinary text without R code
##
##
|
|........................................... | 61%
## label: unnamed-chunk-12
##
|
|.............................................. | 65%
## ordinary text without R code
##
##
|
|................................................. | 70%
## label: unnamed-chunk-13
##
|
|.................................................... | 74%
## ordinary text without R code
##
##
|
|....................................................... | 78%
## label: unnamed-chunk-14
##
|
|.......................................................... | 83%
## ordinary text without R code
##
##
|
|............................................................. | 87%
## label: unnamed-chunk-15
##
|
|................................................................ | 91%
## ordinary text without R code
##
##
|
|................................................................... | 96%
## label: unnamed-chunk-16
##
|
|......................................................................| 100%
## ordinary text without R code
## output file: BCB420_A1_DimitrijeRatkov.knit.md
## "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS BCB420_A1_DimitrijeRatkov.utf8.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash+smart --output BCB420_A1_DimitrijeRatkov.html --email-obfuscation none --self-contained --standalone --section-divs --template "D:\Documents\R\win-library\3.6\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\super\AppData\Local\Temp\RtmpkV6hfP\rmarkdown-str804c303644c4.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --lua-filter "D:/Documents/R/win-library/3.6/rmarkdown/rmd/lua/pagebreak.lua" --lua-filter "D:/Documents/R/win-library/3.6/rmarkdown/rmd/lua/latex-div.lua"
##
## Output created: BCB420_A1_DimitrijeRatkov.html
rmarkdown::render("BCB420_A2S_DimitrijeRatkov.Rmd")
##
##
## processing file: BCB420_A2S_DimitrijeRatkov.Rmd
##
|
| | 0%
|
|.. | 3%
## ordinary text without R code
##
##
|
|.... | 6%
## label: unnamed-chunk-1-2
##
|
|...... | 9%
## ordinary text without R code
##
##
|
|........ | 11%
## label: unnamed-chunk-3 (with options)
## List of 1
## $ fig.cap: chr "1"
##
|
|.......... | 14%
## ordinary text without R code
##
##
|
|............ | 17%
## label: unnamed-chunk-4-5
##
|
|.............. | 20%
## ordinary text without R code
##
##
|
|................ | 23%
## label: unnamed-chunk-6
##
|
|.................. | 26%
## ordinary text without R code
##
##
|
|.................... | 29%
## label: unnamed-chunk-7
##
|
|...................... | 31%
## ordinary text without R code
##
##
|
|........................ | 34%
## label: unnamed-chunk-8
##
|
|.......................... | 37%
## ordinary text without R code
##
##
|
|............................ | 40%
## label: unnamed-chunk-9
##
|
|.............................. | 43%
## ordinary text without R code
##
##
|
|................................ | 46%
## label: unnamed-chunk-10
##
|
|.................................. | 49%
## ordinary text without R code
##
##
|
|.................................... | 51%
## label: unnamed-chunk-11
##
|
|...................................... | 54%
## ordinary text without R code
##
##
|
|........................................ | 57%
## label: unnamed-chunk-12
##
|
|.......................................... | 60%
## ordinary text without R code
##
##
|
|............................................ | 63%
## label: unnamed-chunk-13
##
|
|.............................................. | 66%
## ordinary text without R code
##
##
|
|................................................ | 69%
## label: unnamed-chunk-14
##
|
|.................................................. | 71%
## ordinary text without R code
##
##
|
|.................................................... | 74%
## label: unnamed-chunk-15
##
|
|...................................................... | 77%
## ordinary text without R code
##
##
|
|........................................................ | 80%
## label: unnamed-chunk-16
##
|
|.......................................................... | 83%
## ordinary text without R code
##
##
|
|............................................................ | 86%
## label: unnamed-chunk-17
##
|
|.............................................................. | 89%
## ordinary text without R code
##
##
|
|................................................................ | 91%
## label: unnamed-chunk-18
##
|
|.................................................................. | 94%
## ordinary text without R code
##
##
|
|.................................................................... | 97%
## label: unnamed-chunk-19
##
|
|......................................................................| 100%
## ordinary text without R code
## output file: BCB420_A2S_DimitrijeRatkov.knit.md
## "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS BCB420_A2S_DimitrijeRatkov.utf8.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash+smart --output BCB420_A2S_DimitrijeRatkov.html --email-obfuscation none --self-contained --standalone --section-divs --template "D:\Documents\R\win-library\3.6\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\super\AppData\Local\Temp\RtmpkV6hfP\rmarkdown-str804c59eb12b.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --lua-filter "D:/Documents/R/win-library/3.6/rmarkdown/rmd/lua/pagebreak.lua" --lua-filter "D:/Documents/R/win-library/3.6/rmarkdown/rmd/lua/latex-div.lua" --filter "C:/Program Files/RStudio/bin/pandoc/pandoc-citeproc.exe"
##
## Output created: BCB420_A2S_DimitrijeRatkov.html
We introduce the work done in A1 and A2. We render A1 and A2 to ensure all previous analyses are loaded.
As part of the data processing done in Assignment 1, data was refined from the GSE141220 accession in the GEO database. This accession represents a study done in a Burkitt Lymphoma cell line positive for Epstein-Barr Virus (Frey et al.). Viral reactivation from latency was induced using sodium butyrate = NaB, and Illumina HiSeq 2500 reads were compared to uninduced cells at three time points: three hours post-induction, 24 hours post-induction, and 48 post-induction. The paper aimed to identify cellular factors upstream of viral reactivation; thus the three-hour time point was included, as viral factors are not being expressed yet.
Data was filtered to exclude low-value genes. Data was normalized using the TMM method. Ensembl IDs were mapped to HUGO gene symbols. Mappings to whitespace characters were identified for later consideration. Two sets of duplicate entries were identified, corresponding to lncRNAs, and collapsed into the identifier corresponding to the longer variant. The data was separated into two tables: one of which contained counts with successful HUGO mappings, and another which contained counts identifiable only by ensemblIDs.
Assignment 1
First, of course, we install necessary modules.
Before anything else, I took time to thoroughly read the paper. I also started writing up the data interpretation section, as the first few points could be answered right after reading the paper.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")
## Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'GEOquery'
## package 'GEOquery' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP\downloaded_packages
## Installation path not writeable, unable to update packages: boot, class,
## deldir, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet, survival
BiocManager::install("biomaRt")
## Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'biomaRt'
## Warning: package 'biomaRt' is in use and will not be installed
## Installation path not writeable, unable to update packages: boot, class,
## deldir, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet, survival
BiocManager::install("edgeR")
## Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'edgeR'
## Warning: package 'edgeR' is in use and will not be installed
## Installation path not writeable, unable to update packages: boot, class,
## deldir, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet, survival
library(edgeR)
library(biomaRt)
A promising dataset was found very quickly, from a study investigating downstream cellular effectors of the lytic reactivation process of Epstein Barr virus (EBV). As I have a strong focus on virology, and I wanted to apply that background here, finding an appropriate dataset was as easy as searching for a topic I was already aware of. I had already worked with EBV during my second-year research. The latent-lytic reactivation pathway, dealing with how this latent virus starts an active infection anew (renewing the viral replication process) when conditions are right, is a big topic in EBV research, and bioinformatics approaches are well-suited to address it.
This group chose to examine gene expression in stimulated and non-stimulated cells across three timepoints - 3h, 24h, and 48h - corresponding to a time before viral genes are expressed but a reactivation phenotype is evident, a time when the virus would be in the lytic mode transcribing early-mid proteins, and a time when cell functions are being shut down to maximize viral production. The HH514-16 cell line was used, representing a lymphocyte lineage from a case of EBV-induced Burkitt’s lymphoma. Cells were reactivated using a sodium butyrate signal.
The experiment was done using the Bru-seq protocol; pulling down RNAs using a bromouridine label, constructing a cDNA library, and sequencing on an Illumina HiSeq 2500.
More information on this paper can be found on the corresponding articles on my student wiki.
The initial coverage of the dataset was decent, spanning over 57000 genes across all chromosomes, and also describing viral genes (though we will not be ad). Specifics of coverage will be addressed when the data has been analyzed more thoroughly. The research is new, and the dataset was updated as recently as Jan 15th. Two replicates of the controls and experimental samples were used, which are not as many replicates as would be ideal, but should suffice.
In this case, the raw data is stored in the supplemental documents of the sample entries. It is also accessible through the main series’ supplemental docs, but going through the samples is simpler since we’re only dealing with one supp. doc. at a time, instead of unzipping and going file-by-file. As such we will access it there.
# Save series metadata
series <- unlist(GEOquery::getGEO("GSE141220"))
## Found 1 file(s)
## GSE141220_series_matrix.txt.gz
## Using locally cached version: C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSE141220_series_matrix.txt.gz
## Parsed with column specification:
## cols(
## ID_REF = col_character(),
## GSM4198479 = col_character(),
## GSM4198481 = col_character(),
## GSM4198483 = col_character(),
## GSM4198484 = col_character(),
## GSM4198486 = col_character(),
## GSM4198488 = col_character(),
## GSM4198490 = col_character(),
## GSM4198492 = col_character(),
## GSM4198494 = col_character(),
## GSM4198495 = col_character(),
## GSM4198497 = col_character(),
## GSM4198499 = col_character()
## )
## Using locally cached version of GPL16791 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GPL16791.soft
# Extract sample and file names for automated data processing
sampleFileNames <- unlist(strsplit(series$GSE141220_series_matrix.txt.gz@experimentData@other$sample_id, split="\\s"))
sampleNames <- as.vector(series$GSE141220_series_matrix.txt.gz@phenoData@data$title)
# Initialize an empty list-of-lists to collect the sample data into, once all of it has been gathered
rawCounts <- vector("list", 12)
for (s in sampleFileNames)
{
# We download if the sample file in the corresponding sample subdir doesn't exist.
if (!file.exists(paste(s, "/", "sample.rds", sep = "")))
{
sfile <- GEOquery::getGEOSuppFiles(s, makeDirectory = TRUE)
fname <- rownames(sfile)
b <- read.delim(fname[1], header = TRUE)
rawCounts[[match(s, sampleFileNames)]] <- b
saveRDS(b, file = paste(s, "/", "sample.rds", sep = ""))
}
# Otherwise we load the saved file.
else
{
b <- readRDS(file = paste(s, "/", "sample.rds", sep = ""))
rawCounts[[match(s, sampleFileNames)]] <- readRDS(file = paste(s, "/", "sample.rds", sep = ""))
}
}
names(rawCounts) <- sampleNames
We’ve now loaded each sample’s raw data into of data frames, which are stored in a central list. We will parse this into a ain data
# Start the central data frame, which will gather the raw counts of all the samples; match to genes.
counts <- data.frame(ensemblID = rawCounts[[1]]$ensembl_gene, gname = rawCounts[[1]]$name, eff_len = rawCounts[[1]]$effective_length)
# To be able to iteratively add to this data frame, we initialize it with a single column of the appropriate length.
reads <- data.frame(seed = rep(NA, times = 57733))
for(s in rawCounts)
{
newCol <- data.frame(ree = s$read_count)
reads <- cbind(reads, newCol)
}
# After iterating, we delete this column.
reads$seed <- NULL
# Assign sample names to their corresponding columns and continue.
colnames(reads) <- sampleNames
counts <- cbind(counts, reads)
We are about to filter out low-value genes. Let’s check how much of our data will be pruned.
# Save current size of data set
oldsize <- length(counts$ensemblID)
cpms = edgeR::cpm(counts[,4:15])
rownames(cpms) <- counts[,1]
# We have two replicates for each condition.
keep = rowSums(cpms >1) >=2
counts_filtered <- counts[keep,]
# Now check the new size.
newsize <- length(counts_filtered$ensemblID)
summarized_counts <- sort(table(counts_filtered$ensemblID),decreasing = TRUE)
print("Initial genes in consideration:")
## [1] "Initial genes in consideration:"
print(oldsize)
## [1] 57733
print("Genes after filtering:")
## [1] "Genes after filtering:"
print(newsize)
## [1] 20503
By calling the above size variables, we can see that we’ve reduced the number of genes under consideration from 57733 to 20503. By calling the summarized counts table, we see that there are no duplicate counts.
Now we must normalize the data. Normalization was performed using the TMM method, as the data was in the form of an RNAseq.
Below, normalization of the data, and visualizations before and after that process, can be seen. The “outliers” at negative infinity, throwing warning messages below, are not factually outliers - rather the log2 of 0 is recognized as negative infinity in R.
# To visually examine our data before normalization, we make some box plots.
dataBefore <- log2(cpm(counts_filtered[,4:15]))
boxplot(dataBefore, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "EBV Reac. RNAseq")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 2 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 3 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 4 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 6 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 7 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 8 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 9 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 10 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 11 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 12 is not drawn
abline(h = median(apply(dataBefore, 2, median)), col = "red", lwd = 0.6, lty = "dashed")
filtered_data_matrix <- as.matrix(counts_filtered[,4:15])
rownames(filtered_data_matrix) <- counts_filtered$ensemblID
# The samples are grouped replicate 1,2 for each of six conditions.
samples = rep(1:6, each=2)
d = edgeR::DGEList(counts=filtered_data_matrix, group=samples)
d = calcNormFactors(d)
normalized <- cpm(d)
counts_normalized <- cbind(data.frame(counts_filtered[,1:3]), as.data.frame(normalized))
# ...and another one after normalization.
dataAfter <- log2(cpm(counts_normalized[,4:15]))
boxplot(dataAfter, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "EBV Reac. RNAseq Nrmlzd")
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 2 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 3 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 4 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 6 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 7 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 8 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 9 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 10 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 11 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 12 is not drawn
abline(h = median(apply(dataAfter, 2, median)), col = "green", lwd = 0.6, lty = "dashed")
There is no clear difference between the plots. As the values in the filtered vs. neutralized counts have changed, we assume that the normalization was successful and the data was high-quality to begin with.
Now, we’re finally ready to map to symbols. We will pull from biomaRt, and attempt to maximize the coverage of our dataset using the proper HUGO symbols.
# Save on computation and bandwidth by saving / reading from a file when possible.
conversion_stash <- "ebv_id_conversion.rds"
ensembl <- useEnsembl("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", mirror = "useast")
if(file.exists(conversion_stash)) {
HUGOs <- readRDS(conversion_stash)
} else {
HUGOs <- getBM(attributes = c("ensembl_gene_id_version","hgnc_symbol"), filters = c("ensembl_gene_id_version"), values = counts_normalized$ensemblID, ensembl, useCache = FALSE)
saveRDS(HUGOs, conversion_stash)
}
We’re not out of the woods yet. There are many blank symbols in our HUGOs frame to begin with: we can see this by ranking the occurence of IDs.
summarized_hgnc_counts <- sort(table(HUGOs$hgnc_symbol),decreasing = TRUE)
print(summarized_hgnc_counts[1:10])
##
## ITFG2-AS1 POLR2J4 A2M A2ML1-AS1 A2MP1
## 2889 2 2 1 1 1
## A3GALT2 AACSP1 AADACL2-AS1 AADACP1
## 1 1 1 1
We see 2889 occurences of a blank space (this is a blank space, not a whitespace character). After this, ITFG2-AS1 and POLR2J4 appear with two occurences in the frame. Lastly come the typical unique symbols.
We remove all rows with blank symbols. Later we will collapse the duplicates, which were confirmed to map to the same gene: the two ITFG2-AS1 entries correspond to two variants of the ITFG2 antisense RNA, and the two POLR2J4 entries correspond to an untranscribed LncRNA product. The counts for the shorter variants will be collapsed into the larger ones.
HUGOs <- HUGOs[HUGOs$hgnc_symbol != "", 1:2]
After much effort, a merging of the pulled Mart attribute frame with the normalized counts was successful.
colnames(HUGOs) <- c("ensemblID", "hgnc_symbol")
counts_normal_annot <- merge.data.frame(HUGOs, counts_normalized, by.x = 1, all.y=TRUE)
It’s not clear why so many blank symbols appeared in the biomaRt pull. There may be a temporary issue with the backend servers. There may be an issue with the initial annotation of this data set. In any case, we move on. Now we collapse:
for (i in c(4:15))
{
#collapse ITFG2-AS1 reads
counts_normal_annot[counts_normal_annot$ensemblID == "ENSG00000256150.2", i] = counts_normal_annot[counts_normal_annot$ensemblID == "ENSG00000256150.2", i] + counts_normal_annot[counts_normal_annot$ensemblID == "ENSG00000258325.2", i]
# Collapse POLR2J4 reads
counts_normal_annot[counts_normal_annot$ensemblID == "ENSG00000214783.9", i] = counts_normal_annot[counts_normal_annot$ensemblID == "ENSG00000214783.9", i] + counts_normal_annot[counts_normal_annot$ensemblID == "ENSG00000272655.2", i]
}
# Remove the redundant entry.
counts_normal_annot <- counts_normal_annot[(counts_normal_annot$ensemblID != "ENSG00000258325.2" & counts_normal_annot$ensemblID != "ENSG00000272655.2"),]
Finally, we will create a dataframe containing every protein with a known HGNC symbol, with that symbol acting as the rowname. Where possible, we set down the HUGO symbols as row-names for this frame, and the ensemblIDs for the other.
# If we find an hgnc symbol, we split into this DF.
counts_final_HUGO <- counts_normal_annot[!is.na(counts_normal_annot$hgnc_symbol),]
rownames(counts_final_HUGO) <- counts_final_HUGO$hgnc_symbol
saveRDS(counts_final_HUGO, "HGNC_Labels.rds")
# If not, we split into this one. We exclude the useless hgnc_symbol column from this.
counts_final_old <- counts_normal_annot[is.na(counts_normal_annot$hgnc_symbol),-c(2)]
rownames(counts_final_old) <- counts_final_old$ensemblID
saveRDS(counts_final_old, "Ensembl_Labels.rds")
For later analysis, if we need a non-ensemblID unique identifier, we can construct and assign unique names based on the gname column of the old mapping. For now, uniqueness is maintained by assigning ensemblIDs as row names.
** What are the control and test conditions of the dataset? **
The experiment was done in the HH514-16 cell line, which are lymphocytes from a case of EBV-induced Burkitt’s lymphoma. The control conditions are untreated cells of this cell line. The test condition is an application of the lytic signal sodium butyrate (NaB), which induces cells latently infected with EBV to undergo lytic reactivation. The treatment conditions were applied for three different durations - 3h, 24h and 48h - across two replicates each.
** Why is the dataset of interest to you? **
The dataset is of interest because this system is familiar to me, and because it’s addressing a question I’ve explored in the past, helping us achieve a comprehensive understanding. Furthermore, there’s a practical application for this information. Epstein-Barr is a very common infection - by most estimates over 90% of adults worldwide are infected - that wreaks havoc on immunosuppressed patients.
Notably, organ transplant recipients are prone to transplant failure and secondary infections if they are given an organ from a donor who has latent EBV - these patients are put on immunosuppressive therapies to protect the transplant from immune attack, which means that reactivated virus has an easier time spreading, which in turn reactivates the immunity and causes further damage to host and transplant tissues. Understanding the lytic reactivation process is the first step in creating treatments to suppress viral spread and pathogenesis in these patients. In healthy patients, EBV transforms B cells, causing lymphoma; the latent phase is responsible for this, but lytic reactivation allows for the latently-infected population to expand, increasing the chances of that event. Lytic reactivation is also necessary for the spread of the virus between hosts.
In patients, as in the lab, not every cell is infected simultaneously. Rather, a subpopulation of cells is susceptible to reactivation at any given time. The BZLF1 viral gene, and its ZEBRA protein product, are well-known to mediate reactivation; but a host of cellular proteins have been identified as interactors with this protein. In fact, reactivation is impossible without the presence of certain pro-lytic genes, which must be actively expressed in the cell for this to happen. This suggests a pro-lytic state exists, which facilitates the function of viral proteins and allows for the program of viral replication to restart. This state has not been characterized in great detail, nor is its temporal evolution apparent yet. This study aims to resolve those problems from a bioinformatics perspective.
** Were there expression values that were not unique for specific genes? How did you handle these? **
The most common instance of this were genes which had zero reads across all samples. These were filtered out using the CPM method. Otherwise, no attempt was made to handle such a situation. Theoretically the wrong reads could have been duplicated across gene entries, but this was deemed too difficult to address and likely to be a nonissue.
** Were there expression values that could not be mapped to current HUGO symbols? **
There were many. Out of the initial biomaRt pull, almost 3000 ensemblIDs lacked a HUGO symbol. Furthermore, many ensembleIDs weren’t represented in the biomaRt pull at all. Such missing values were mapped to ensemblIDs instead, for the time being. I will follow up regarding the best avenue to deal with this in the future.
** How many outliers were removed? **
None. Without better understanding of how genes are upregulated or downregulated over the course of EBV infection, calling a result an “outlier” has no certain basis in fact. Manually pruning suspicious results would represent a type of investigator bias; additionally, there are outlier-proofed data analysis methods (for instance edgeR apparently has some good ones) which look at the problem from the statistical side, without the need for value judgements.
While these do not necessarily qualify as outliers, 37230 small-impact genes were initially removed from consideration.
** How did you handle replicates? **
Experimental replicates were represented using individual columns in the data frame.
If duplicate symbol entries are meant by “replicates”, these were dealt with by collapsing into the less spliced / larger / more representative gene product. In my case, these were untranslated RNAs, so this is likely an adequate method. A possible loss of resolution for these two proteins was deemed an acceptable loss for maintaining the cardinality of the data.
** What is the final coverage of your dataset? **
The final coverage of the HUGO-mapped dataset is 4675 genes; an additional 15827 are included if we consider the old mapping. All things considered this seems like a very small coverage for an RNAseq experiment, but not certainly useless; also, I don’t know what type of numbers are expected after the workflow has been processed and whatnot.
Before the analyses to be done in Assignment 2, all Assignment 1 requirements were fulfilled. This included a visualization of the counts as an HTML table, and as MDS plots. MDS plotting revealed that HUGO results grouped within timepoints, while eID results were more strangely distributed, with replicates showing greater similarity than timepoint groups. This was remarked upon, but the HUGO results were used in all analyses, and the eID anomaly was not relevant further.
Differential gene expression analysis was performed. First, a row-normalized heatmap was applied. Timepoint- and replicate-specific transcription programs were observed, again. The data was fit to two separate models, one considering replicate differences, and another ignoring these. This approach found more significant differences between groups when differences between replicates were accounted for. So, the data was considered with this distinction in mind.
The quasi-likelihood test was used to analyze differential expression, identify false discoveries, and correct data accordingly by excluding them. These genes were again visualized by heatmap, and also by a volcano plot. To better contextualize the results of these analyses, the thresholded analysis was performed.
In this way, we observed an upregulation of actin remodeling pathways has been previously noted in the later stages of EBV infection, along with cytoskeletal binding proteins (Price et al.). We identified RNA synthesis, export, and metabolism as downregulated areas; previous research has demonstrated that these activities are downregulated, with EBV proteins taking over crucial roles in transcription and RNA processing for viral RNAs. (Park et al.). Some of our findings corroborated the original source paper’s analysis (Frey et al.), which aimed mostly to identify associations at the three-hour time point. They identified that upregulated genes were involved in amoebiasis and cell-cell adhesion, highlighting the importance of actin remodeling; and that downregulated genes were involved in RNA transport and metabolism.
Assignment 2
Load modules to be used in this notebook. Render Assignment 1.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")
## Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'GEOquery'
## package 'GEOquery' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP\downloaded_packages
## Installation path not writeable, unable to update packages: boot, class,
## deldir, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet, survival
BiocManager::install("knitr")
## Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'knitr'
## Warning: package 'knitr' is in use and will not be installed
## Installation path not writeable, unable to update packages: boot, class,
## deldir, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet, survival
BiocManager::install("ComplexHeatmap")
## Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'ComplexHeatmap'
## Warning: package 'ComplexHeatmap' is in use and will not be installed
## Installation path not writeable, unable to update packages: boot, class,
## deldir, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet, survival
BiocManager::install("circlize")
## Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'circlize'
## Warning: package 'circlize' is in use and will not be installed
## Installation path not writeable, unable to update packages: boot, class,
## deldir, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet, survival
BiocManager::install("edgeR")
## Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'edgeR'
## Warning: package 'edgeR' is in use and will not be installed
## Installation path not writeable, unable to update packages: boot, class,
## deldir, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet, survival
BiocManager::install("EnhancedVolcano")
## Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
## Installing package(s) 'EnhancedVolcano'
## Warning: package 'EnhancedVolcano' is in use and will not be installed
## Installation path not writeable, unable to update packages: boot, class,
## deldir, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet, survival
library(knitr)
library(ComplexHeatmap)
library(circlize)
library(edgeR)
library(EnhancedVolcano)
INTRO
Data was refined from the GSE141220 accession in the GEO database. This accession represents a study done in a Burkitt Lymphoma cell line positive for Epstein-Barr Virus. Viral reactivation from latency was induced using sodium butyrate = NaB, and Illumina HiSeq 2500 reads were compared to uninduced cells at three time points: three hours post-induction, 24 hours post-induction, and 48 post-induction. The paper aimed to identify cellular factors upstream of viral reactivation; thus the three-hour time point was included, as viral factors are not being expressed yet
Data was filtered to exclude low-value genes. Data was normalized using the TMM method. Ensembl IDs were mapped to HUGO gene symbols. Mappings to whitespace characters were identified for later consideration. Two sets of duplicate entries were identified, corresponding to lncRNAs, and collapsed into the identifier corresponding to the longer variant. The data was separated into two tables: one of which contained counts with successful HUGO mappings, and another which contained counts identifiable only by ensemblIDs.
Loose Ends and Selection of Dataset for Analysis
In the submitted version of Assignment 1, the final data was not shown, it was not saved to an output file, and MDS plots were not generated. Two saveRDS() calls were added to the A1 file. We load data from the saved file, output sample HTML tables of our datasets (Table 1, 2), and generate MDS plots.
# Read tables from saved files.
counts_HUGO <- readRDS("HGNC_Labels.rds")
counts_eID <- readRDS("Ensembl_Labels.rds")
samples = rep(1:6, each=2)
# A sample of the full tables is visualized.
kable(counts_HUGO[1:10,1:16], caption = "Table 1: Counts with HUGO Mappings. Counts which could successfully be mapped to HUGO identifiers.", format = "pandoc")
| ensemblID | hgnc_symbol | gname | eff_len | UT_3h_R1_rep1 | UT_3h_R1_rep2 | UT_24h_R1_rep1 | UT_24h_R1_rep2 | UT_48h_R1_rep1 | UT_48h_R1_rep2 | NaB_3h_R1_rep1 | NaB_3h_R1_rep2 | NaB_24h_R1_rep1 | NaB_24h_R1_rep2 | NaB_48h_R1_rep1 | NaB_48h_R1_rep2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DPM1 | ENSG00000000419.12 | DPM1 | DPM1 | 23689 | 166.4999076 | 136.5644319 | 152.0941940 | 120.0384827 | 126.0321576 | 123.4780792 | 350.2342469 | 262.7200958 | 242.312215 | 226.601994 | 258.595007 | 215.4885871 |
| NFYA | ENSG00000001167.14 | NFYA | NFYA | 27032 | 93.7111597 | 99.9934116 | 97.0354183 | 97.9519079 | 71.1942572 | 88.8201858 | 50.8325378 | 75.1446043 | 75.398509 | 78.529835 | 65.569392 | 74.0370496 |
| LAS1L | ENSG00000001497.16 | LAS1L | LAS1L | 22175 | 35.2846979 | 39.6870183 | 30.6863415 | 55.0837421 | 27.9657089 | 36.2200600 | 27.5659653 | 39.4962100 | 10.274242 | 16.331593 | 16.974062 | 20.4177062 |
| KRIT1 | ENSG00000001631.15 | KRIT1 | KRIT1 | 40767 | 147.0167466 | 143.8223917 | 133.1496516 | 118.1092635 | 144.7168314 | 137.4817461 | 96.8981716 | 101.9175032 | 96.651079 | 93.595659 | 82.590628 | 116.5742579 |
| MYH16 | ENSG00000002079.14 | MYH16 | MYH16 | 72302 | 0.0762274 | 0.0000000 | 0.0661815 | 0.0000000 | 0.0886866 | 0.0548567 | 0.1000208 | 0.0000000 | 1.192262 | 1.190479 | 2.330430 | 0.5414071 |
| BAD | ENSG00000002330.13 | BAD | BAD | 14840 | 6.5052292 | 6.3578352 | 8.5716799 | 6.7431556 | 8.2232841 | 4.8268957 | 13.2614206 | 9.9166836 | 7.308203 | 5.641317 | 6.913772 | 4.3635124 |
| LAP3 | ENSG00000002549.12 | LAP3 | LAP3 | 30777 | 129.7590612 | 128.2643821 | 109.7387379 | 113.9909599 | 81.2181883 | 81.5280852 | 137.2160642 | 129.4843750 | 52.393941 | 56.141705 | 224.043202 | 239.0082867 |
| MAD1L1 | ENSG00000002822.15 | MAD1L1 | MAD1L1 | 417376 | 280.6957379 | 205.1033590 | 220.6897378 | 149.8288722 | 156.2248952 | 112.3910849 | 332.7430994 | 254.0615428 | 192.882441 | 145.881337 | 183.787057 | 150.5959313 |
| SNX11 | ENSG00000002919.14 | SNX11 | SNX11 | 15526 | 19.0327786 | 11.8702846 | 14.5459191 | 12.1524956 | 17.5967377 | 11.5936481 | 21.7927397 | 15.2386963 | 25.700183 | 23.569422 | 29.636421 | 15.8947326 |
| KLHL13 | ENSG00000003096.14 | KLHL13 | KLHL13 | 219426 | 0.1524420 | 0.2858658 | 0.3571202 | 0.2809505 | 0.4611361 | 0.2467937 | 0.0199965 | 0.7224917 | 19.230850 | 11.750682 | 16.663066 | 10.0008334 |
kable(counts_eID[1:10,1:15], caption = "Table 2: Counts without HUGO Mappings. Counts which couldn't be successfully mapped to HUGO identifiers - the reason for this is not clear.", format = "pandoc")
| ensemblID | gname | eff_len | UT_3h_R1_rep1 | UT_3h_R1_rep2 | UT_24h_R1_rep1 | UT_24h_R1_rep2 | UT_48h_R1_rep1 | UT_48h_R1_rep2 | NaB_3h_R1_rep1 | NaB_3h_R1_rep2 | NaB_24h_R1_rep1 | NaB_24h_R1_rep2 | NaB_48h_R1_rep1 | NaB_48h_R1_rep2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ENSG00000000457.13 | ENSG00000000457.13 | SCYL3 | 44329 | 42.9278627 | 45.6876732 | 52.6688643 | 59.225118 | 61.2207244 | 78.9057306 | 38.624215 | 38.1869092 | 30.8445082 | 35.774313 | 28.082649 | 35.2332085 |
| ENSG00000000460.16 | ENSG00000000460.16 | C1orf112 | 59351 | 159.1157258 | 152.7146716 | 149.1150522 | 117.474203 | 130.0031150 | 137.2115071 | 160.050011 | 163.6677099 | 72.3092412 | 23.639997 | 30.296596 | 18.3746044 |
| ENSG00000000938.12 | ENSG00000000938.12 | FGR | 23214 | 24.2381004 | 16.3266577 | 26.7241277 | 27.044951 | 19.8477252 | 22.4425371 | 9.820986 | 7.6242677 | 6.2262730 | 19.447620 | 5.127088 | 8.5364753 |
| ENSG00000000971.15 | ENSG00000000971.15 | CFH | 95627 | 0.0635106 | 0.0000000 | 0.0487004 | 0.000000 | 0.0000000 | 0.0548397 | 0.080012 | 0.0249252 | 5.8951579 | 3.373355 | 13.089689 | 6.4019734 |
| ENSG00000001084.10 | ENSG00000001084.10 | GCLC | 47789 | 109.3436432 | 107.8744336 | 102.1004006 | 100.663562 | 100.9240361 | 102.3608246 | 43.544801 | 45.0490574 | 77.4537934 | 91.002192 | 110.817185 | 118.7455461 |
| ENSG00000001460.17 | ENSG00000001460.17 | STPG1 | 58099 | 10.8757283 | 9.2911453 | 9.1434721 | 9.763154 | 15.8210450 | 17.9087964 | 19.861903 | 19.5086970 | 28.7530026 | 21.601027 | 15.264669 | 15.2569204 |
| ENSG00000001461.16 | ENSG00000001461.16 | NIPAL3 | 57183 | 9.7195377 | 13.0078435 | 13.9328602 | 11.903561 | 33.2210948 | 34.9019336 | 36.303816 | 44.3262644 | 50.1858858 | 50.772778 | 39.308057 | 56.6713932 |
| ENSG00000001561.6 | ENSG00000001561.6 | ENPP4 | 16707 | 14.8678149 | 20.3087577 | 17.1762349 | 19.703950 | 17.9048370 | 22.5320409 | 18.082072 | 18.8121384 | 9.7233123 | 12.161753 | 9.166674 | 8.3773183 |
| ENSG00000001617.11 | ENSG00000001617.11 | SEMA3F | 33891 | 0.0254125 | 0.0000000 | 0.0162299 | 0.000000 | 0.1064255 | 0.0274198 | 0.080012 | 0.0498195 | 0.8831607 | 1.162133 | 1.631398 | 0.6203103 |
| ENSG00000001626.14 | ENSG00000001626.14 | CFTR | 188510 | 0.0635189 | 0.0857486 | 0.0324614 | 0.000000 | 0.0531955 | 0.0000000 | 0.000000 | 0.0249097 | 17.1554659 | 6.349629 | 11.030887 | 3.4579562 |
# Make MDS plots of successful HUGO mappings.
plotMDS(counts_HUGO[5:16], labels=colnames(counts_HUGO[5:16]), col=c("darkgreen", "blue", "red", "magenta", "yellow", "cyan")[factor(samples)])
1
plotMDS(counts_eID[4:15], labels=colnames(counts_eID[4:15]), col=c("darkgreen", "blue", "red", "magenta", "yellow", "cyan")[factor(samples)])
1
Upon generation of MDS plots for either data set, a curious pattern emerges. The HUGO dataset seems to group consistently within timepoints, and the relative distances are exactly as expected. The UnTreated samples gradually change from the 3h time point up to 48h, presumably due to transcriptional changes as the cultures are maintained longer, and some baseline level of viral reactivation. The NaB-treated sample at 3 hours clusters far from the other NaB-treated samples, due to the effect of NaB exposure, with no role yet played by viral proteins; later clusters are closer together because progressive viral replication is ongoing, and viral proteins are affecting the cell’s functions.
On the other hand, the eID dataset clusters oddly. The UnTreated samples roughly recapitulate the same pattern as for the HUGO label genes. However, clustering across timepoints is seen for the 24h and 48h NaB-treated samples. This was not due to erroneous table construction on our end, as GEO names were iterated over to construct the table on a per-column basis. The following code chunk uses the same loop as the table construction, but outputs sample names instead, to demonstrate that the error was not introduced at this point.
# Get sample GEO codes and sample filenames from series
series <- GEOquery::getGEO("GSE141220")
## Found 1 file(s)
## GSE141220_series_matrix.txt.gz
## Using locally cached version: C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSE141220_series_matrix.txt.gz
## Parsed with column specification:
## cols(
## ID_REF = col_character(),
## GSM4198479 = col_character(),
## GSM4198481 = col_character(),
## GSM4198483 = col_character(),
## GSM4198484 = col_character(),
## GSM4198486 = col_character(),
## GSM4198488 = col_character(),
## GSM4198490 = col_character(),
## GSM4198492 = col_character(),
## GSM4198494 = col_character(),
## GSM4198495 = col_character(),
## GSM4198497 = col_character(),
## GSM4198499 = col_character()
## )
## Using locally cached version of GPL16791 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GPL16791.soft
sampleFileNames <- unlist(strsplit(series$GSE141220_series_matrix.txt.gz@experimentData@other$sample_id, split="\\s"))
sampleNames <- as.vector(series$GSE141220_series_matrix.txt.gz@phenoData@data$title)
# Pull names from individual sample accessions
verifyNames <- vector("character", 12)
i <- 1
for (s in sampleFileNames)
{
sfile <- GEOquery::getGEO(s)
verifyNames[i] <- sfile@header$title
i <- i+1
}
## Using locally cached version of GSM4198479 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198479.soft
## Using locally cached version of GSM4198481 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198481.soft
## Using locally cached version of GSM4198483 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198483.soft
## Using locally cached version of GSM4198484 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198484.soft
## Using locally cached version of GSM4198486 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198486.soft
## Using locally cached version of GSM4198488 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198488.soft
## Using locally cached version of GSM4198490 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198490.soft
## Using locally cached version of GSM4198492 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198492.soft
## Using locally cached version of GSM4198494 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198494.soft
## Using locally cached version of GSM4198495 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198495.soft
## Using locally cached version of GSM4198497 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198497.soft
## Using locally cached version of GSM4198499 found here:
## C:\Users\super\AppData\Local\Temp\RtmpkV6hfP/GSM4198499.soft
# Show that the names extracted from the series accession match the names extracted from the samples
comparisonTable <- data.frame(namesFromSeries = sampleNames, namesFromSamples = verifyNames)
rownames(comparisonTable) <- sampleFileNames
kable(comparisonTable, caption = "Table 3: A comparison of file names as extracted from the series metadata vs. as extracted from the sample accession directly.", format = "pandoc")
| namesFromSeries | namesFromSamples | |
|---|---|---|
| GSM4198479 | UT_3h_R1_rep1 | UT_3h_R1_rep1 |
| GSM4198481 | UT_3h_R1_rep2 | UT_3h_R1_rep2 |
| GSM4198483 | UT_24h_R1_rep1 | UT_24h_R1_rep1 |
| GSM4198484 | UT_24h_R1_rep2 | UT_24h_R1_rep2 |
| GSM4198486 | UT_48h_R1_rep1 | UT_48h_R1_rep1 |
| GSM4198488 | UT_48h_R1_rep2 | UT_48h_R1_rep2 |
| GSM4198490 | NaB_3h_R1_rep1 | NaB_3h_R1_rep1 |
| GSM4198492 | NaB_3h_R1_rep2 | NaB_3h_R1_rep2 |
| GSM4198494 | NaB_24h_R1_rep1 | NaB_24h_R1_rep1 |
| GSM4198495 | NaB_24h_R1_rep2 | NaB_24h_R1_rep2 |
| GSM4198497 | NaB_48h_R1_rep1 | NaB_48h_R1_rep1 |
| GSM4198499 | NaB_48h_R1_rep2 | NaB_48h_R1_rep2 |
This indicates that no labeling error was introduced during data construction (Table 3). Potentially, the data was incorrectly uploaded to the GEO database. Another possibility is that the second replicates of these points were somehow treated differently, and massive experimental bias was introduced.
In any case, as only the HUGO-labeled data set will be used in the enrichment, this drastic difference will not come into play (though the underlying quality of the data remains questionable, and it may explain any oddities between the samples which may come up during the course of this analysis). This highlights the importance of figuring out why the HUGO mapping was so inefficient, and how these gene sets are different - which will be done separately after assignment submission and can be found on the wiki.
Differential Gene Expression
We begin by making a row-normalized heatmap, to obtain a preliminary visualization of the difference between samples.
# Define initial heatmap matrix
heatmap_matrix <- counts_HUGO[,5:ncol(counts_HUGO)]
rownames(heatmap_matrix) <- counts_HUGO$hgnc_symbol
colnames(heatmap_matrix) <- colnames(counts_HUGO[,5:ncol(counts_HUGO)])
# Normalize the heatmap
heatmap_matrix <- t(scale(t(heatmap_matrix)))
if(min(heatmap_matrix) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix)), c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix), 0, max(heatmap_matrix)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix), show_row_dend = TRUE, show_column_dend = TRUE, col=heatmap_col, show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE)
# Display the heatmap
current_heatmap
We see that there is a distinct transcription program occurring at later time points when the virus is induced, as would be expected. At the 3h time point we find that expression is more similar to the untreated samples than it is to later NaB samples. Interestingly, we see that replicates are clustered more closely than time points in certain cases - UT 3h / 24h, NaB 24h / 48h - as with our earlier observation regarding the MDS plot. Since the heatmap allows us to compare genewise expression at a glance, it is more informative in this regard than the MDS plot, with its abstracted coordinate method.
Next, we will model the data to identify the top genes which are differentially expressed. Treatment status and time point are expected to contribute to differential expression. Given the differences observed between replicates thus far, it was decided to include this factor in the analysis also. Two models will be compared at this point: one ignoring the difference between replicates, and another accounting for replicate identity. The edgeR package was used as it is intended for the processing of RNAseq data. Dispersion estimation was followed by a Quasi-Likelihood test given that this was bulk RNAseq data.
(To be clear, I am genuinely unsure whether this distinction should be made - I am not just imitating the lecture 5 notes. )
# Split samples according to different properties, for later grouping
modelDF <- data.frame(lapply(colnames(counts_HUGO)[5:16], FUN=function(x){
unlist(strsplit(x, split = "\\_"))[c(1,2,4)]}))
colnames(modelDF) <- colnames(counts_HUGO)[5:16]
rownames(modelDF) <- c("treatment","time","rep")
modelDF <- data.frame(t(modelDF))
# Give compact sample of table
modelDF[1:3,]
# Define two models
modelNoRep <- model.matrix(~ modelDF$treatment + modelDF$time)
modelRep <- model.matrix(~ modelDF$treatment + modelDF$time + modelDF$rep)
# Fit data to model not considering replicate identity
d1 = DGEList(counts=as.matrix(counts_HUGO[5:16]), group=modelDF$treatment)
d1 <- estimateDisp(d1, modelNoRep)
fit1 <- glmQLFit(d1, modelNoRep)
# Fit data to model considering replicate identity
d2 = DGEList(counts=as.matrix(counts_HUGO[5:16]), group=modelDF$treatment)
d2 <- estimateDisp(d2, modelRep)
fit2 <- glmQLFit(d2, modelRep)
# Get top hits according to each model
qlf.noRep <- glmQLFTest(fit1, coef="modelDF$treatmentUT")
qlf.rep <- glmQLFTest(fit2, coef="modelDF$treatmentUT")
hitsNoRep <- topTags(qlf.noRep, sort.by = "PValue", n=nrow(counts_HUGO))
hitsRep <- topTags(qlf.rep, sort.by = "PValue", n=nrow(counts_HUGO))
# Merge top hits of both models into one DF for plotting
hitsNoRep_PValues <- data.frame(
hngc = rownames(hitsNoRep$table),
noRep_PValue=hitsNoRep$table$PValue)
hitsRep_PValues <- data.frame(
hngc = rownames(hitsRep$table),
rep_PValue=hitsRep$table$PValue)
two_models_PValues <- merge(hitsNoRep_PValues,
hitsRep_PValues,
by.x=1,by.y=1)
# Plot without axes restriction
two_models_PValues$colour <- "black"
two_models_PValues$colour[two_models_PValues$noRep_PValue<0.05] <- "orange"
two_models_PValues$colour[two_models_PValues$rep_PValue<0.05] <- "blue"
two_models_PValues$colour[two_models_PValues$noRep_PValue<0.05 & two_models_PValues$rep_PValue<0.05] <- "red"
plot(two_models_PValues$noRep_PValue,
two_models_PValues$rep_PValue,
col = two_models_PValues$colour,
xlab = "No Replicate model p-values",
ylab ="Replicate model p-values",
main="Distinguishing Reps vs. Not")
legend("topleft",
legend = c("Sig. without replicates", "Sig. with replicates", "Sig. in both cases"),
col = c("orange", "blue", "red"),
pch = 1,
bty = "o",
horiz = F,
inset = c(0.05, 0.05))
# Plot zoomed in up to a p-value of 0.1
plot(two_models_PValues$noRep_PValue,
two_models_PValues$rep_PValue,
col = two_models_PValues$colour,
xlab = "No Replicate model p-values",
ylab ="Replicate model p-values",
main="Distinguishing Reps vs. Not, Zoomed",
xlim = c(0, 0.1),
ylim = c(0, 0.1))
legend("topleft",
legend = c("Sig. without replicates", "Sig. with replicates", "Sig. in both cases"),
col = c("orange", "blue", "red"),
pch = 1,
bty = "o",
horiz = F,
inset = c(0.05, 0.05))
This highly linear relationship means that there is not much difference between the models - and this was predctable given that the same calculation is being done with slightly different parameters. Note that a distinguishable proportion of genes falls under this line, while few fall above it. Those genes which are significant according to the no-replicate model, but not the replicate model, are so few as to be invisible until the graph is zoomed in. Meanwhile, zooming the graph shows us that there aren’t as many blue points as it seems at first glance.
By accounting for the differences between replicates, we find more significant genes. As a result of this comparison it was decided that the replicate model would be used. Let’s see how many genes were significantly differentially expressed, and how many survive correction. Thresholds of 0.05 will be used in both cases; the false discovery rate is calculated alongside the p value by the quasi-likelihood test.
Significant genes:
print(length(which(hitsRep$table$PValue < 0.05)))
## [1] 2980
Significant genes which pass correction:
print(length(which(hitsRep$table$PValue < 0.05 & hitsRep$table$FDR < 0.05)))
## [1] 2809
# Separate out significant genes which pass correction.
filteredHits <- hitsRep$table[which(hitsRep$table$PValue < 0.05 & hitsRep$table$FDR < 0.05),]
Let’s make a heatmap of only the genes which are represented in this cohort.
# Isolate represented genes
top_hits <- rownames(hitsRep$table)[hitsRep$table$PValue<0.05 & hitsRep$table$FDR<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
# Display heatmap clustered
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
current_heatmap
# Display heatmap with columns not clustered
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
current_heatmap
Finally, we visualize the differentially expressed genes using a volcano plot. The EnhancedVolcano package was used for this task.
EnhancedVolcano(hitsRep$table,
lab = rownames(hitsRep$table),
x = 'logFC',
y = 'PValue',
xlim = c(-5, 8))
From the volcano plot, we see that some of the highly-significant upregulated genes include TRDC, KCNQ5, and PTPN7, while some of the highly-significant downregulated genes include NEBL and RNF217. These proteins have a variety of diverse functions: TRDC is the T-cell delta constant region variant; KCNQ5 is a potassium channel expressed in parts of the brain and in skeletal muscle; and PTPN7 is a protein tyrosine phosphatase which promotes cell growth and differentiation. Meanwhile, NEBL is an actin-remodeling protein expressed in cardiac muscle, while RNF217 is a ubiquitin protein ligase known to be inactivated in certain cancers. We see hints of oncogenic potential in the upregulation of PTPN7 and downregulation of RNF217, but there are also random genes that should by rights be expressed only in highly-differentiated tissues. We reserve further judgement for the thresholded analysis.
Notice that the volcano plot suggests most of our genes are downregulated, while the heat map suggests they’re mostly upregulated. This is likely an error as explained in the following section.
Thresholded Over/Under-Representation Analysis
finalHits_Upreg <- filteredHits[which(filteredHits$logFC > 0), ]
finalHits_Downreg <- filteredHits[which(filteredHits$logFC < 0), ]
With a threshold of p < 0.05, we have this many upregulated genes:
print(length(finalHits_Upreg$PValue))
## [1] 747
With a threshold of p < 0.05, we have this many downregulated genes:
print(length(finalHits_Downreg$PValue))
## [1] 2062
Looking at our heat map, we see that there is most likely a mixup, as the heat map suggests many more genes are upregulated in the treatment group than downregulated. This could be a result of the NaB treatment being considered as the baseline, rather than the untreated group. As a kludgy fix for this we’ll just invert the signs on the log fold-change for now, and reassign the finalHits dataframes accordingly.
filteredHits$logFC <- -filteredHits$logFC
finalHits_Upreg <- filteredHits[which(filteredHits$logFC > 0), ]
finalHits_Downreg <- filteredHits[which(filteredHits$logFC < 0), ]
With a threshold of p < 0.05, we have this many upregulated genes:
print(length(finalHits_Upreg$PValue))
## [1] 2062
With a threshold of p < 0.05, we have this many downregulated genes:
print(length(finalHits_Downreg$PValue))
## [1] 747
To run an analysis on these lists, we must first export genesets.
write.table(rownames(filteredHits), file = "totalHits.txt", sep = "\n", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(rownames(finalHits_Upreg), file = "upregHits.txt", sep = "\n", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(rownames(finalHits_Downreg), file = "downregHits.txt", sep = "\n", row.names = FALSE, col.names = FALSE, quote = FALSE)
These gene sets will be analyzed with g:profiler, more specifically the g:GOSt module. This tool was selected because the annotation dataset I selected for the in-class homework assignment deals with viral insertion sites in the human genome, which is not relevant to EBV since it doesn’t do this. g:profiler is not only familiar, since it was used in class, but also useful - we have no idea about the function of the genes which are differentially expressed in our dataset.
Benjamini-Hochberg FDR was used to determine significance and false detections (Fig. 1). The thresholded gene sets were exported in text format, which was then used in a g:profiler analysis. The data sources used were the GO biological process, the Reactome pathway database, and WikiPathways (Fig. 2).
Figure 1: Advanced options used for g:profiler. Depicts web form before gene name entry.
Figure 2: Data sources used for g:profiler.
No drastic association was visible on the basis of the total DE data, uncategorized into up/downregulated lists. I suspected that these results were being dominated by the more differenetial expression category, and this instinct was confirmed when the other queries were examined. The strongest associations for the unsorted data lay with cellular organization (Fig. 3).
Figure 3: GO Biological Process gene sets returned using the whole DE gene list.
Analyzing the upregulated gene list - which constituted a greater proportion of the total gene data than the downregulated gene list did - it was clear that it contributed more to the results of the total unsorted query; this query is also dominated by cellular organization terms.
We also see a number of terms relating to neuronal organization (Fig. 4). The Reactome data shows an equivalent association with muscle funtion and neuronal pathways (Fig. 5), a puzzling correlation at first - but a common factor does exist here, that being actin remodeling and function. A more interesting association is with neuronal signaling; but as neurotransmitters are transported along axons by transport proteins, this could merely be another correlation caused by actin-associated genes.
Figure 4: GO BP gene sets returned using the upregulated DE gene list. Prominent sets include those involved in cell projection organization, cell adhesion, and morphogenesis. Many neuronal gene sets are also present.
Figure 5: Reactome terms returned using the upregulated DE gene list. Prominent results include neuronal and muscular pathways.
The downregulated gene list showed a number of associations much stronger than had been seen for the upregulated list, mostly involved in a variety of semi-related metabolic processes (Fig. 6). At first this was a bit daunting to interpret, but the Reactome data showed a very strong association with RNA metabolism, which we attempted to elucidate further - the terms was thus thresholded on the basis of size.
Figure 6: GO BP gene sets returned using the downregulated DE gene list. A number of nucleobase and aromatic compound processing pathways are featured.
Figure 7: Reactome terms returned using the downregulated DE gene list. By far the most prominent result is the RNA metabolism pathway, with the cell cycle presenting at a distant second
When this data was thresholded on the basis of dataset size, we see more precise (if weaker) associations come to the forefront. Instead of “cellular nitrogen compound metabolic process” we now see RNA metabolism and ribosome biogenesis taking the top spots (Fig. 9).
Figure 9: GO BP gene sets returned using the downregulated DE gene list, and thresholded for term size, with an imposed limit of 1000 elements. We now see that RNA is more specifically the compound whose synthesis is being regulated, as opposed to nucleobases or aromatic compounds generally.
Interpretation
We’ve already interpreted the results in various places throughout this assignment, but now we’ll take the opportunity to discuss ties to the literature and previous research.
The upregulation of actin remodeling pathways has been previously noted in the later stages of EBV infection, along with cytoskeletal binding proteins (Price et al.) - presumably including the transport proteins involved in neuronal signaling down the length of axons (Figs. 4,5). The same is true for the downregulated pathways. We identified RNA synthesis, export, and metabolism as downregulated areas; previous research has demonstrated that these activities are downregulated, with EBV proteins taking over crucial roles in transcription and RNA processing for viral RNAs. (Park et al.).
Our findings aren’t precisely the same as the source paper (Frey et al.), which aimed mostly to identify associations at the three-hour time point, before viral transcription started. It sought to identify factors that were upstream of viral transcription. However, some of our findings are corroborated, especially regarding the GSEA they performed. They identified that upregulated genes were involved in amoebiasis, cell-cell adhesion, and olfactory transduction (a type of neural signaling); and that downregulated genes were involved in RNA transport and metabolism, and ribosome biogenesis (the third result for our thresholded downregulation GSEA (Fig. 9)).
One interesting contrast with the source paper is that they noted three times more genes that were downregulated - our data had the opposite imbalance, with three times more upregulated genes after we manually adjusted our log fold-differences due to a supposed incorrect modelling. And yet, the aforementioned similarities with the literature. How this fact ties in with our similar results in other ways (and our general agreement with the literature) may be due to the lack of a HUGO mapping step on their part - again raising questions as to what the difference was between the genes which could be mapped to HUGO and those which couldn’t.
R Modules Used
The following R modules were used in this project: GEOquery, knitr, ComplexHeatmap, circlize, edgeR, EnhancedVolcano
References
For whatever reason the automatic Rmarkdown bibliography syntax isn’t working.
Price AM, Tourigny JP, Forte E, Salinas RE, Dave SS, Luftig MA. 2012. Analysis of Epstein-Barr Virus-Regulated Host Gene Expression Changes through Primary B-Cell Outgrowth Reveals Delayed Kinetics of Latent Membrane Protein 1-Mediated NF- B Activation. J Virol. doi:10.1128/jvi.01069-12.
Park R, Miller G. 2018. Epstein-Barr Virus-Induced Nodules on Viral Replication Compartments Contain RNA Processing Proteins and a Viral Long Noncoding RNA. J Virol. doi:10.1128/jvi.01254-18.
Frey TR, Brathwaite J, Li X, Burgula S, Akinyemi IA, Agarwal S, Burton EM, Ljungman M, McIntosh MT, Bhaduri-McIntosh S. 2020. Nascent transcriptomics reveal cellular pro-lytic factors upregulated upstream of the latency-to-lytic switch protein of Epstein-Barr virus. J Virol. doi:10.1128/jvi.01966-19.
First, we must generate an appropriate set for this approach. For non-thresholded analysis, we want all the genes ranked by their quasi-likelihood p-value.
Recall that in A2 we found that our quasi-likelihood test produced opposite signs from what was expected. In creating our non-thresholded set, we’ll invert the signs, like we did in A2, as a quick fix.
hitsRep$table[,"rank"] <- log(hitsRep$table$PValue, base=10) * sign(hitsRep$table$logFC)
finalHits_NT <- data.frame(hitsRep$table)
finalHits_NT <- finalHits_NT[order(finalHits_NT$rank, decreasing = TRUE),]
printHits <- data.frame(GeneName <- rownames(finalHits_NT), rank <- finalHits_NT$rank, stringsAsFactors = TRUE)
colnames(printHits) <- c("GeneName", "rank")
printHits
Now we’ll save the ranked list, and analyze using the standalone Java program.
write.table(printHits, file = "NTHits.txt", sep = "\t", eol = "\n", row.names = FALSE, quote = FALSE)
The resulting file was used in GSEA. As for Assignment 2, the genesets used were Reactome, WikiPathways, and GO biological process. The April 1st 2020 Bader lab genesets were used; for GOBP, the all pathways / no IEA variant was used. The default number of permutations was used: 1000. The dataset was not collapsed or remapped in any way. A weighted enrichment statistic was applied.
For each gene set, 15-1000 and 15-200 size limits were tested; in all cases, the 15-200 gene set size limits were more informative, so these results are displayed below.
Figure 1: GOBP non-thresholded GSEA overview. Many gene sets were significantly enriched in the virus-infected cells, even for a p-value of <1. Note that there are more significantly-downregulated gene sets, even though less than half as many gene sets are downregulated in total.
Figure 2: Top six GOBP gene sets upregulated in virus-infected cells. Note the actin-focused gene sets upregulated here, such as genes involved in taxis and synapse organization. Note the steep leading edges in 4/6 cases.
Figure 3: Top six GOBP gene sets downregulated in virus-infected cells. Again, nucleobase metabolism is the main target. We see cell cycle genes are downregulated too. All the leading edges are sharper.
Figure 4: Reactome non-thresholded GSEA overview. A similar pattern is seen as before, with more gene sets being significantly downregulated in the infected cells.
Figure 5: Top six Reactome gene sets upregulated in virus-infected cells. Unlike for our thresholded analysis, no muscular or cardiac pathways are found in the top results. These results are interesting because they are associated specifically with neurons and synapses, not just incidentally as part of actin-remodeling complexes. Note the neurexin / neuroligin gene set, and the L1CAM axon-specific receptor interaction. These are found in the thresholded analysis too, but not so strongly associated. Interesting metabolism and environment interaction gene sets are seen too, involved in PI metabolism and with ECM organization.
Figure 6: Top six Reactome gene sets downregulated in virus-infected cells. Nothing much novel is discovered here. We see rRNA pathway interaction, but this could be nonspecific RNA metabolism elements interacting.
Figure 7: WikiPathways non-thresholded GSEA overview. Unlike before, more sets are significantly upregulated in infected cells, and the disparity between upregulated and downregulated sets is much greater.
Figure 8: Top six WikiPathways gene sets upregulated in virus-infected cells. As per above, these are all significiantly-upregulated genes. We see some cardiomyocyte-associated gene sets, alongside previously-encountered actin and synaptic gene sets.
Figure 9: Top six WikiPathways gene sets downregulated in virus-infected cells. Note that only the first two are significantly downregulated. In addition to mRNA processing and the cell cycle, we see miRNA regulation, a liver-disease associated gene set, and immune components.
Figure 10: Expanded upregulated Reactome gene sets. This emphasizes the findings of the summary depicted above: a greater focus on specifically neuronal genesets, on novel metabolic sets (appearing additionally as phospholipid metabolism, and as PIP synthesis), and generally on broad classes of metabolism (see rows 7 and 13).
Compared to our thresholded GSEA, we see broadly similar results, but some of the isolated findings of our thresholded study are more clearly evident in this analysis. However some interesting differences are evident also.
The most obvious difference in the GO BP data was the upregulation of taxis gene sets (Fig. 2); two such genesets were found in the top six, but they were nowhere to be seen in the thresholded upregulated results for GO BP. The same goes for the cell cycle geneset, which can be found signficantly downregulated according to this analysis, but was not found in the thresholded results (Fig. 3).
In the Reactome data, for our thresholded analysis, we noticed a common thread of actin-related pathways being upregulated. Here, the focus was much greater on neuronal functions specifically (Fig. 5), even outside the top six (Fig. 10). In addition to this, PI metabolism and ECM interaction was upregulated; various other metabolic pathways are also represented in the expanded data. However, the downregulated data was mostly similar (Fig. 6).
We did not show WikiPathways data for the thresholded analysis, because it directly recapitulated the results found with Reactome and GO BP. Here, the notable things are that cardiac functionalities are upregulated (Fig. 8) - which was also seen in the Reactome thresholded analysis, but is more functionally important here - and that a nonfatty liver disease geneset is strongly associated. The apparent significance of the former could be an artifact due to the smaller number of total genesets which were significant for WikiPathways, but the latter is an interesting new association.
The differences listed above are not due to different size limits. If anything, smaller size limits were used for the non-thresholded analysis - 200 vs. 1000 - so, these are truly new associations.
The GSEA results were translated into Cytoscape for analysis.
Cytoscape 3.8.0 was downloaded and installed; the EnrichmentMap Pipeline Collection v1.1.0 was installed also. For simplicity and coverage, the GOBP results (Corresponding to Figs. 1-3) were visualized with an enrichment mapping.
Under “Number of Nodes”, the checkbox was ticked, and an FDR q-value cutoff of 0.25 was used. Under “Number of Edges”, “Data Set Edges” was set to automatic, and connectivity was set to mid-sparse. Otherwise, default parameters were used. The node cutoff was a q-value of 0.25, and the edge cutoff / similarity value was 0.25. The resulting graph had 806 nodes and 3792 edges as per the Cytoscape Analyzer tool.
Figure 11: Raw enrichment graph created by the Cytoscape EnrichmentMap plugin according to the above parameters.
Next, an auto-annotation of the graph was performed. The parameters used to do this are displayed below (Fig. 12).
Figure 12: Parameters used to AutoAnnotate the enrichment graph.
The annotated and labeled enrichment graph was saved in two halves. The main purpose of these figures was to give context to the initial mapping; for legibility, and for analysis, it was decided that the theme map would be relied on.
Figure 13: Annotated top half of the enrichment graph depicted in Figure 11. On the left, note that checkpoint factors and DNA repair is downregulated. At the top, we see various translation, export, and metabolism pathways inhibited in infection. Again, we see many morphogenetic and neurological pathways upregulated, bottom center; and on the right a cluster of ion homeostasis gene sets which would also have a neural function.
Figure 14: Annotated bottom half of the enrichment graph depicted in Figure 11. Points of interest include the strong upregulation of metabolic and transport pathways involved in neuronal activity (top, center left) and the downregulation of the proteasome and apoptotic pathways (top, center right).
To create a theme map, the auto-annotation was repeated as above. Additionally, the “layout network to prevent cluster overlap” box was ticked; the number of annotated clusters was set to 40. After the annotation was performed, the clusters and labels were moved around for improved legibility and spacing. Then, a publication-quality figure was saved exported.
Figure 15: Annotated theme map of GSEA performed on cells infected with Epstein-Barr Virus. Upregulation of neuronal pathways and muscular signaling can be seen; more clearly, cell projection control and motility is also upregulated. Meanwhile, mitotic checkpoint factors are knocked down (left); various RNA processing pathways are as well (center), alongside general ribonucleotide metabolism (top). The theme map recapitulates the broad themes of our findings exactly. Line width and color on cluster demarcator shapes is not meaningful; Cytoscape applied it inconsistently for some reason.
First, we will recap the analysis of our thresholded data, as per Assignment 2. Recall that our positive phenotype is reactivation of Epstein-Barr Virus (EBV).
Initially, we interpreted the neuron-associated gene sets as an accidental appearance, caused by the extreme importance of actin remodeling and transport pathways in neurons (and also in muscles, another pattern that was observed). These were all strongly upregulated in infected cells.
The most important downregulated sets had to do with RNA metabolism and ribosome biogenesis.
Generally the non-thresholded GSEA recapitulates the results of the thresholded analysis. The upregulated results as per GO BP are practically identical (Fig. 2). As for the thresholded analysis, downregulated sets were more statistically significant.
The first major difference we see is that the cell cycle appears at the top of the downregulated GO BP gene sets (Fig. 3); this is a big deal, because we know for a fact that EBV manipulates the cell cycle to ensure infected cells survive as long as possible. (In fact this activity, along with changes like checkpoint removal, allow for oncogenic changes to take place.)
Next, we see differences in the upregulated Reactome genesets (Fig. 5). An extracellular matrix pathway is newly represented, which was not found in the thresholded data. Otherwise, the same genesets are found; they are just more emphasized than in the thresholded data. Protein-protein interactions at synapses, neurexins and neuroligins, and PI metabolism are all present in the thresholded data, but well outside the top six gene sets.
The same pattern is seen for downregulated genes according to Reactome (Fig. 6). We see that more specific gene sets dominate the data. WikiPathways results weren’t shown for the thresholded analysis, but analyzing the results of the nonthresholded WikiPathways analysis (Fig. 8, 9) we see our previous observations recapitulated, with the stark exception that genes associated with non-fatty liver disease are downregulated.
Zeroing in on that last gene set, by looking at the GSEA summary analysis, we can identify . Genes such as BID (featuring death domain), ATF4 (cAMP signaling), and various metabolic genes are involved. The former factors that would be downregulated by a virus avoiding apoptosis, while the latter is covered by the known metabolic changes associated with infection. So this may to be an artifact, realistically, especially considering the lack of breadth found in WikiPathways as opposed to the other two databases used.
Looking at the data more closely, the upshifting of less significant sets in the thresholded data appears to be an artifact caused by the lower size threshold used for this analysis. Thus, we can’t necessarily draw conclusions from relative position changes. Also, while cell cycle is novel in the GO BP data, it previously appeared in the thresholded analysis using the Reactome database. In summary, while some of the results have shifted around within and across databases, our fundamental observations remain unchanged!
This is potentially because the viral modulation is all-encompassing to the point that a non-thresholded analysis has no more subtle effects to unearth, i.e. all affected genesets are sufficiently affected to appear in the thresholded analysis; alternatively, the benefit of the genes included in the non-thresholded analysis was wiped out by the size limit we imposed. The former is more likely, since the GSEAs only had 10-15 fewer significant gene sets for the analysis done with the 15-200 size limit vs the 15-1000 size limit; and these were essentially just broader categories or superfamilies of the genesets that were significant in the 15-200 analysis (just like the thresholded pre- and post-size-limit results).
The main piece of information we can derive from the enrichment map concerns the nature of the transcriptional change. Because we see interconnectivity between upregulated and downregulated major clusters (Fig. 11), even on a sparser graphing setting and with an FDR limit applied, we can deduce that there is a tightly-regulated and broad change in regulation, affecting many genes separately. We would expect something like that from a herpesvirus like EBV; this virus family has a large genome and expresses a lot of nonstructural genes to modulate host response in intricate ways.
Note that the enrichment map doesn’t obviously feature many of our interesting gene sets mentioned previously. This reinforces the fact that the utility of such a visualization lies in the ability to get a broad theme overview, and to gauge the properties of the regulatory changes like we did in the previous paragraph. Enrichment maps are not suitable for deep GSEA analysis; they can be useful for that purpose if specific gene sets / pathways are sought out based on the raw GSEA data (which was one of the options for post-analysis; we picked another).
The upregulation of actin remodeling pathways has been previously noted in the later stages of EBV infection, along with cytoskeletal binding proteins (Price et al.) Adhesion molecules are upregulated on EBV-infected cells (Nanbo et al.); neurotrophic factors, extracellular matrix interactors, and inflammatory cytokines were upregulated by EBV dUTPase, in dendritic cells infected by EBV (Williams et al). Conversely, previous research demonstrated that RNA synthesis and metabolism was downregulated, with EBV proteins taking over crucial roles in transcription and RNA processing for viral RNAs (Park et al).
The main startling contrast with the source paper had to do with the general form of their data. They reported three times more downregulated genes, but our data, which was corrected based on the heat map created at the beginning of A2, reports more upregulated genes.
One would expect a virus infection to result in mass downregulation of genes throughout the cell. This is a puzzling contradiction (and not just a mistake on my part) because the aforementioned similarities with the literature completely match our interpretation, and more confusingly still, our data generally matches the source paper too!
Their analysis may have been changed by the inclusion of more genes. Going all the way back to Assignment 1, a large portion of the genes could not be mapped to symbols; when I sat down to find a common thread, no pattern was obvious in the genes that couldn’t be mapped. How this fact ties in with our similar results in other ways (and our general agreement with the literature) is not clear. It may be due to the lack of a HUGO mapping step on their part; they seem to have used eIDs instead.
As a parting note, the cells used in the analysis were Burkitt’s lymphoma cells - of immune lineage, and the normal cell type where EBV lurks latent in infected individuals - so the expression patterns seen over the course of our analysis are not artefacts of the model.
Knowing that EBV is an oncogenic virus, the April 01 2020 set of drug targets was taken from the Bader lab site, and used to perform a signature analysis. A Mann-Whitney one-sided test was used to determine which drugs overlapped with the gene sets upregulated in the infected phenotype (Fig. 16); the inspiration for this idea was that, potentially, chemotherapy drugs effective against EBV-associated tumors could be evaluated for their mechanism of effect, and if that was not the case other drugs might be found that would block alleviate viral reactivation.
Figure 16: Showing the candidate signature gene sets, alongside the test parameters and ranks to be used for the analysis.
However, EBV being a virus, most studies of EBV inhibition focus on antiviral compounds which specifically interfere with viral proteins, and which wouldn’t easily be detected by a GSEA analysis because they ideally shouldn’t modulate the expression of human genes.
Regardless, two medicines appeared in the list of significantly-overlapping drugs, which have been reported to interact with EBV progression in the past. Dehydroepiandrosterone (DHEA) gave a Mann-Whitney P-score of 0.0346, interacting with 11 genes across all genesets. Melatonin gave a Mann-Whitney P-score of 0.0445, interacting with 10 genes across all genesets.
DHEA is an endogenous hormone which is an intermediate for sex hormone production, but which also binds various receptors throughout the body, and has various functions as a result, including in the immunity. It shows strong associations with the calcium homeostasis gene sets (Fig. 17, bottom signature point). During the AIDS epidemic, DHEA gained attention because of reports that it was protective against oncogenic EBV transformation (Henderson et al.). After the effect was found to be limited to EBV and not useful outside this context, no further research was done.
However, we may be able to explain the observed effect, given our signature analysis. The transformation of cells by EBV is dependent on a calcium influx, which acts as a survival signal to the infected B cell (Dellis et al.), allowing it to develop into a cancer cell. If this pathway is blocked by DHEA, transformation of cells by EBV would be inhibited, and that would explain the protective effect.
Melatonin is a hormone that chiefly regulates neurological activity. Given that neurological phenotypes are strongly upregulated in our infected cells, we may expect that the association arose as a result of numerous interactions with neurologically-involved genesets. Instead, it shows strong associations with transcription templated polymerase gene sets (Fig. 17, top signature point), and weak associations with a number of others.
Melatonin caused a stir in the early 2010s when it was found to be cytotoxic to EBV-associated lymphomas (Paternoster et al.). The association with the transcription templated polymerase cluster suggests that melatonin globally regulates transcription in some way; so it may be the case that melatonin acts to activate pathways which have been shut down by EBV, and which are toxic to the tumor cells once unblocked, potentially by allowing apoptosis to proceed.
Both drugs showed a number of other assorted associations, most notably with an unmarked cluster of SUMOylation genes on the left side of the theme map.
Figure 17: Signature analysis of EBV-associated tumor inhibitors DHEA (bottom) and melatonin (top) in the context of EBV infection.
Assuming these findings are potentially suggestive for future pharmaceutical applications, it is probable that DHEA wouldn’t be suitable as a treatment against EBV given its broad range of roles in the body. However, melatonin is well-tolerated when supplemented; and so this could be an over-the-counter prophylactic for conditions like mononucleosis, or a boost to cancer therapies.
Price AM, Tourigny JP, Forte E, Salinas RE, Dave SS, Luftig MA. 2012. Analysis of Epstein-Barr Virus-Regulated Host Gene Expression Changes through Primary B-Cell Outgrowth Reveals Delayed Kinetics of Latent Membrane Protein 1-Mediated NF- B Activation. J Virol. doi:10.1128/jvi.01069-12.
Park R, Miller G. 2018. Epstein-Barr Virus-Induced Nodules on Viral Replication Compartments Contain RNA Processing Proteins and a Viral Long Noncoding RNA. J Virol. doi:10.1128/jvi.01254-18.
Frey TR, Brathwaite J, Li X, Burgula S, Akinyemi IA, Agarwal S, Burton EM, Ljungman M, McIntosh MT, Bhaduri-McIntosh S. 2020. Nascent transcriptomics reveal cellular pro-lytic factors upregulated upstream of the latency-to-lytic switch protein of Epstein-Barr virus. J Virol. doi:10.1128/jvi.01966-19.
Paternoster L, Radogna F, Accorsi A, Cristina Albertini M, Gualandi G, Ghibelli L. 2009. Melatonin as a modulator of apoptosis in B-lymphoma cells. In: Annals of the New York Academy of Sciences.
Henderson E, Schwartz A, Pashko L, Abou-gharbia M, Swern D. 1981. Dehydroepiandrosterone and 16α-bromo-epiandrosterone: Inhibitors of epstein-barr virus-induced transformation of human lymphocytes. Carcinogenesis. doi:10.1093/carcin/2.7.683.
Dellis O, Arbabian A, Papp B, Rowe M, Joab I, Chomienne C. 2011. Epstein-Barr virus latent membrane protein 1 increases calcium influx through store-operated channels in B lymphoid cells. J Biol Chem. doi:10.1074/jbc.M111.222257.